Guided parallelized stochastic gradient descent for delay compensation
نویسندگان
چکیده
Stochastic gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. However, with the rapid growth of big data deep learning, SGD is no longer most suitable choice due natural behavior sequential optimization error function. This has led development parallel algorithms, such as asynchronous (ASGD) synchronous (SSGD) train networks. it introduces a high variance delay in parameter (weight) update. We address this our proposed try minimize impact. employed guided (gSGD) that encourages consistent examples steer convergence by compensating unpredictable deviation caused delay. Its rate also similar A/SSGD, however, some additional (parallel) processing required compensate for The experimental results demonstrate approach able mitigate impact quality classification accuracy. SSGD clearly outperforms even achieves accuracy close benchmark datasets.
منابع مشابه
Parallelized Stochastic Gradient Descent
With the increase in available data parallel machine learning has become an in-creasingly pressing problem. In this paper we present the first parallel stochasticgradient descent algorithm including a detailed analysis and experimental evi-dence. Unlike prior work on parallel optimization algorithms [5, 7] our variantcomes with parallel acceleration guarantees and it poses n...
متن کاملAsynchronous Stochastic Gradient Descent with Delay Compensation
With the fast development of deep learning, people have started to train very big neural networks using massive data. Asynchronous Stochastic Gradient Descent (ASGD) is widely used to fulfill this task, which, however, is known to suffer from the problem of delayed gradient. That is, when a local worker adds the gradient it calculates to the global model, the global model may have been updated ...
متن کاملSupplementary Material: Asynchronous Stochastic Gradient Descent with Delay Compensation
where Cij = 1 1+λ ( uiujβ lilj √ α ), C ′ ij = 1 (1+λ)α(lilj) , and the model converges to the optimal model, then the MSE of λG(wt) is smaller than the MSE of G(wt) in approximating Hessian H(wt). Proof: For simplicity, we abbreviate E(Y |x,w∗) as E, Gt as G(wt) and Ht as H(wt). First, we calculate the MSE of Gt, λGt to approximate Ht for each element of Gt. We denote the element in the i-th r...
متن کاملVariational Stochastic Gradient Descent
In Bayesian approach to probabilistic modeling of data we select a model for probabilities of data that depends on a continuous vector of parameters. For a given data set Bayesian theorem gives a probability distribution of the model parameters. Then the inference of outcomes and probabilities of new data could be found by averaging over the parameter distribution of the model, which is an intr...
متن کاملByzantine Stochastic Gradient Descent
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the m machines which allegedly compute stochastic gradients every iteration, an α-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds ε-approximate minimizers of convex functions in T = Õ ( 1...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied Soft Computing
سال: 2021
ISSN: ['1568-4946', '1872-9681']
DOI: https://doi.org/10.1016/j.asoc.2021.107084